Data about the hybrid open access uptake is critical. Although policy recommendations have addressed open and transparent workflows in recent years, identifying hybrid open access articles and funding sources remains challenging. In this post, we show how to mine such data from Elsevier using openly available sources and R. Between 2015 and July 2019, Elsevier’s subscription journals published 63,577 hybrid open access articles, representing a share 2.3% of the overall publication volume of these journals. A data analysis reveals a growing uptake of agreements between Elsevier and funders to cover costs for open access. Not surprisingly, mostly British and Dutch funders sponsor hybrid open access. But also the German Federal Ministry of Education and Research is well represented despite the current Elsevier boycott from most universities and research organizations in Germany. Nevertheless, the majority of funding sources is still unknown, raising important questions about the transparency of this publishing model.
In September 2018, the cOAltion S, a group of international research funders including the European Commission, announced its widely discussed Plan S. According to its principles, publication fees that may arise when publishing open access should be covered by funders or research organizations. Although surveys suggest that most authors do not pay such fees out of their pockets, publishers rarely share such evidence. But also not all funding organizations and research institutions disclose open access sponsorship at the article-level.
In this blogpost, we present a dataset comprising spending information from Elsevier, a major global publisher, and contrast these figures with the overall publication volume of its subscription-based journals. The resulting dataset is openly available along with the source code.
Methods follow the Hybrid Open Access Journal Dashboard, an interactive analytical application from the SUB Göttingen. Instead of using spending data from the Open APC Initiative, the Elsevier publication fee price list shared as pdf document was used to obtain hybrid open access journals. The rOpenSci tabulizer package allowed to extract the data from this file.
Next, Crossref REST API was queried to discover open access articles published in these journals, as well as to retrieve yearly article volumes for the period 2015 - 2019. Using the rcrossref client, developed and maintained by the rOpenSci initiative, the first API call retrieved all license URLs available per journal. I also drew on facet field counts to obtain the yearly article volume per journal from Crossref. After matching license URLs indicating open access articles, a second API call checked licensing metadata per journal. Here, I excluded delayed open access articles by using the Crossref’s REST API filters license.url and license.delay. For every immediate open access article, comprehensive Crossref metadata was obtained including links to full-texts available for text- and data-mining.
Elsevier provides full-texts in html and xml via the Crossref Text and Data Mining Services (Crossref-TDM). Interfac The xml representation not only contains the full-text, but also comprehensive metadata including information about open access sponsorship.
<openaccess>1</openaccess>
<openaccessArticle>true</openaccessArticle>
<openaccessType>Full</openaccessType>
<openArchiveArticle>false</openArchiveArticle>
<openaccessSponsorName>
Arts and Humanities Research Council
</openaccessSponsorName>
<openaccessSponsorType>FundingBody</openaccessSponsorType>
<openaccessUserLicense>
http://creativecommons.org/licenses/by/4.0/
</openaccessUserLicense>
Snapshot of open access metadata in Elsevier XML full. https://api.elsevier.com/content/article/pii/S1475158518302261
After interfacing the Elsevier full-texts with the crminer package, a client maintained by rOpenSci, open access information was extracted.
Furthermore the first author email address was parsed using pattern matching, assuming that email domains roughly indicate the affiliation of the first respective corresponding author at the time of publication, an important data point in open access analytics. Next, I split the email domains in its parts with urltools. To avoid misuse, particularly academic spamming, only email domains are publicly shared.
The resulting dataset comprises the following variables:
library(rmarkdown)
hybrid_df <- readr::read_csv("data/els_hybrid_info_normalized.csv")
paged_table(head(hybrid_df, 10))
and is openly shared via GitHub.
In total, the dataset comprises 63,577 hybrid open access articles from 1,703 hybrid open access journals published between January 2015 and July 2019.
Using this datasets, the share of hybrid open access articles per journal was calculated. To explore variations among journals, Bob Rudis ggeconodist package was used. The package does a great job replicating a boxplot aesthetics from the Economist magazine.
The figure shows a slow, but steady hybrid open access uptake. The median open access proportion was around 3% in the first seven months in 2019. 1,703 of 1,985 subscription journals from Elsevier offering hybrid open access did in fact publish at least one article under this model, corresponding to an share of 86 %.
Elsevier usually requires authors to pay a publication fee, also known as article processing charge (APC) to publish open access. Many authors make use of funding from grant agencies or academic institutions to cover such fees. To streamline this process, some funding bodies and institutions have agreed central payment options for affiliated researcher. Elsevier also provides APC waivers.
In most cases, payment notifications were send to the authors paid directly 59 %. Elsevier lists a funding body covering the open access publication fee for around one third of articles.
The following interactive visualization let’s you browse for funders as disclosed by Elsevier.
Mostly British and Dutch funders sponsored hybrid open access in Elsevier journals. But also the German Federal Ministry of Education and Research (BMBF) is well represented despite the current boycott from most universities and research organizations in Germany. Since 2018, the BMBF financially supported 152 hybrid open access articles that appeared in 110 Elsevier journals according to the publisher.
In addition to funding information, email domains were parsed from Elsevier full-texts. These domains roughly indicate the affiliation of the first or of the corresponding authors, respectively, a data point used to delineate open access funding. In the following, a hierarchical, interactive treemap visualizes the distribution of the email domains. Each top-level domain can be subdivided further into domain names representing academic institutions or companies. The size of each rectangle is proportional to the number of hybrid open access articles corresponding to this domain.